Genomic analysis of SARS-CoV-2 variants of concern circulating in Hawai’i to facilitate public-health policies

Using genomics, bioinformatics and statistics, herein we demonstrate the effect of statewide and nationwide quarantine on the introduction of SARS-CoV-2 variants of concern (VOC) in Hawai’i. To define the origins of introduced VOC, we analyzed 260 VOC sequences from Hawai’i, and 301,646 VOC sequences worldwide, deposited in the GenBank and global initiative on sharing all influenza data (GISAID), and constructed phylogenetic trees. The trees define the most recent common ancestor as the origin. Further, the multiple sequence alignment used to generate the phylogenetic trees identified the consensus single nucleotide polymorphisms in the VOC genomes. These consensus sequences allow for VOC comparison and identification of mutations of interest in relation to viral immune evasion and host immune activation. Of note is the P71L substitution within the E protein, the protein sensed by TLR2 to produce cytokines, found in the B.1.351 VOC may diminish the efficacy of some vaccines. Based on the phylogenetic trees, the B.1.1.7, B.1.351, B.1.427, and B.1.429 VOC have been introduced in Hawai’i multiple times since December 2020 from several definable geographic regions. From the first worldwide report of VOC in GenBank and GISAID, to the first arrival of VOC in Hawai’i, averages 320 days with quarantine, and 132 days without quarantine. As such, the effect of quarantine is shown to significantly affect the time to arrival of VOC in Hawai’i. Further, the collective 2020 quarantine of 43-states in the United States demonstrates a profound impact in delaying the arrival of VOC in states that did not practice quarantine, such as Utah. Our data demonstrates that at least 76% of all definable SARS-CoV-2 VOC have entered Hawai’i from California, with the B.1.351 variant in Hawai’i originating exclusively from the United Kingdom. These data provide a foundation for policy-makers and public-health officials to apply precision public health genomics to real-world policies such as mandatory screening and quarantine.


Introduction
Hawai'i has experienced unique epidemics within the coronavirus disease 2019 (COVID- 19) pandemic, in that Pacific Islanders, which account for 4% of the population, once accounted for nearly 30% of COVID-19 cases [1]. Further, the Japanese population of Hawai'i currently accounts for 6% of the population and experiences 15% of COVID-19 cases. White persons, in contrast, account for 37% of the population and 25% of the cases [2]. As such, a heightened need exists to understand SARS-CoV-2 introduction into Hawai'i and the effect of public policy measures. Early in the pandemic, in an attempt to control the spread of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), Hawai'i, like 42 other states in the United States, implemented a quarantine defined by "Stay-at-Home" orders. State-at-Home orders directed residents to stay inside homes except for essential needs and closed operations of non-essential businesses [3]. In addition to this public policy, more than 22,300 SARS-CoV-2 sequences submitted to GISAID and GenBank originate from Hawai'i to facilitate further studies.
The Pangolin/Phylogenetic Assignment of Named Global Outbreak (PANGO) Lineage nomenclature system developed by Rambaut and colleagues [4] for SARS-CoV-2 lineages has allowed for manageable and efficient partitioning of Hawai'i and worldwide sequences for the rapid determination of SARS-CoV-2 origin. The partitioning converts the vast number of sequences into smaller collections of pre-defined similar sequences. These can further generate multiple sequence alignments (MSA) to produce phylogenetic trees efficiently and at low cost.
Using the CDC-classified SARS-CoV-2 VOC (B.1.1.7, B.1.351, B.1.427, B.1.429, and P.1), identified in Hawai'i [5][6][7] as an example, we demonstrate a method to define the origin of SARS-CoV-2 lineages and VOC. This method works using either open-source or licensed software with either a personal computer or a supercomputer. Additionally, we evaluate the effect of quarantine in delaying the arrival of VOC. Using these methods and associated analysis to define the origin of introduction of VOC and determine the impact of quarantine, publichealth officials can develop evidence-based policies to curtail the spread of VOC.

Defining the origin of SARS-CoV-2 VOC
Here, we present a method to define SARS-CoV-2 VOC origin. The architecture of this analysis is outlined in Fig 1 ( April 2, 2021, in the GenBank and global initiative on sharing all influenza data (GISAID) as an example. In brief, the method includes: 1) The lineage-defining sequences of SARS-CoV-2 Lineage A and Lineage B act as the most ancestral roots [4]. Lineage A (EPI_ISL_406801) is from GISAID, and Lineage B (MN908947) is from GenBank. 2) To identify lineages of interest in an area, filter GISAID by location (e.g.: North America/USA/ Hawai'i) and download all sequences. For VOC with >10,000 sequences, GISAID sequences were downloaded in batches due to GISAID maximum download size. Similarly, all geographically similar sequences reported in GenBank were downloaded using the search term SARS-CoV-2 and state abbreviation (e.g., "SARS-CoV-2 HI") and the sequence length filter (20,000). 3) Combine the GISAID and GenBank sequences into one.fasta file using AliView, Geneious Prime 2021.0.3 (http://www.geneious.com), or a text editor, and assign lineages using Pangolin Lineage Assigner (pangolin.cog-uk.io) [4,[8][9][10]. 4) Download the results to Microsoft Excel, use advanced filter to copy unique records of lineages to a new column (ex: column M), then use COUNTIF (e.g., = COUNTIF($B$2:$B$1432,M2)) to determine prevalence of each lineage. Alternatively, upload the results to Google Sheets and use the = UNIQUE command (e.g., = UNIQUE(B2:B1432) followed by the above COUNTIF command. 5) Identify sequences that are the lineage of interest (e.g.: B.1.429), from using Pangolin Lineage Assigner, and download those sequences individually from GenBank. Filter GISAID by the lineage of interest (e.g., B.1.429) and download all sequences. 6) Combine lineage of interest (B.1.429) GenBank sequences, GISAID (B.1.429) sequences, and EPI_ISL_406801 into one fasta file. 7) Align sequences using multiple alignment using fast Fourier transform (MAFFT) program or server (https://mafft.cbrc.jp/alignment/server/add_fragments.html? frommanualnov6) [11][12][13] with MN908947 as a reference and do not remove any uninformative sequences and all parameters set as "same as input." 8) Remove the newly added MN908947 sequence that MAFFT places at the beginning of the alignment using AliView, Geneious Prime, or a text editor. If not, the sRNA toolbox will remove the MN908947 sequence during the duplicate removal step, and Lineage B will not serve as an ancestral root in the phylogenetic tree. 9) Import MSA file into Geneious Prime or AliView [10], search for the orf1a 5' start of the entire alignment (5'-atggagagccttgtccctggtttca-3') and remove the 5' untranslated region (UTR) by deleting the upstream region (~265 bp) from the MSA. Next, search for ORF10 3' end (5'-tgtagttaactttaatctcacatag-3') and remove the entire 3' UTR by deleting the downstream region (~229 bp) from the MSA. 10) Create a duplicate file for the MN908947 sequence and remove the 5' UTR and 3' UTR from MN908947 as described above. 11) Using MAFFT, align the trimmed MSA with the trimmed MN908947 as a reference and delete sequences with uncalled nucleotides 'n'. Set the "remove  ). Collectively, the workflow defines the VOC of a particular area, so research may identify a VOC of interest to conduct the workflow. Then all worldwide sequences of the selected VOC are aggregated into a Multiple Sequence Alignment (MSA). The MSA is modified to remove ambiguous and uninformative sequences and added to the ancestral lineages A (GISAID Accession # EPI_ISL_406801) and B (GenBank Accession # MN908947). Duplicate sequences are merged via their accession number (creating appendages), and the duplicates are removed from the MSA. An approximately maximum-likelihood phylogenetic tree is generated and rooted with ancestral lineage A. The identification (ID) of the most recent common ancestor (MRCA) of the sequences of interest defines the origin. Maps generated with the open-source "usmap" and"ggplot2" packages and R under GPL-3 and MIT+ license and are free to use (https://cran.r-project. org/web/packages/usmap/usmap.pdf)(https://cran.r-project.org/web/packages/ggplot2/index.html).
https://doi.org/10.1371/journal.pone.0278287.g001 uninformative sequences" parameter in the MAFFT at >0%. 12) Using sRNAtoolbox program or server (https://arn.ugr.es/srnatoolbox/helper/removedup/) [14], load the updated alignment to remove duplicate sequences and merge identifications (also referred to as sequence accession numbers) of duplicates. This merger will create "appendages" in the phylogenetic tree where the sRNA toolbox will line up identical sequences together with equal signs (=). 13) Import the final alignment into Geneious Prime and create an approximately maximum-likelihood phylogenetic tree using the FastTree program [15]. Alternatively, FastTree can run as standalone software, and FastTreeMP is appropriate when multiple CPU cores/threads are available. 14) Root the tree with Lineage A (EPI_ISL_406801), which should then be the most recent common ancestor (MRCA) to Lineage B (MN908947) if performing phylogenetics on a Lineage B subgroup. Identify the MRCA of each sequence of interest.
For the B.1.1.7 VOC, which began with 272,732 sequences, we partitioned the sequences into seven sub-MSAs of~50,000 sequences and performed the above method on each sub-MSA. After unambiguous sequences and duplicates were removed from each group, sub-MSA were recombined using AliView [10]. Duplicates were removed after each recombination of two sub-MSA, except for the final MSA due to size restrictions. Four sub-MSA were combined to create the final MSA. All sequences in the final MSA are unambiguous. However, there are likely duplicate sequences present. FastTreeMP generated the phylogenetic tree for B.

Identifying the consensus of each VOC
To identify consensus SNPs of SARS-CoV-2 VOCs, assign Lineage B as the reference sequence, and use the Geneious Prime "Find Variations/SNPs" Annotate and Predict function to identify consensus SNPs. Input SNPs into the SnapGene (Insightful Science, snapgene.com) to identify the nucleotide and amino acid number and substitution as described previously [16].

Evaluating the effect of quarantine
To test the hypothesis that the 67-day (2020-03-25 to 2020-05-31) [17] quarantine in Hawai'i, and the collective 43-state quarantine in the United States that occured from 2020-03-11 to 2020-06-16, significantly delayed the arrival of VOC, we partitioned the VOC into two categories. The first category of "quarantine" are those VOC (B.  May 20, 2021. For each VOC, we calculated the days between the first worldwide report and each of the first ten genetically distinct strains of VOC introduced in Hawai'i. We then compared the quarantine group to the post-quarantine group by days-to-arrival using an independent t-test in RStudio version 1.3.1093 (R version 4.0.3) and plotted with ggplot2 and ggstatsplot packages [18][19][20]. Further, we evaluated one of the seven states (AR, IA, ND, NE, SD, UT, WY) [21] that did not participate in the collective 43-state quarantine to determine if the delay in the arrival of VOC between quarantine and non-quarantine time periods exists for a state that did not quarantine. The nearest geographic state to Hawai'i of the remaining seven states which did not quarantine is Utah. Therefore, we evaluated Utah's first ten genetically distinct reports of each of the five VOC (Utah only had three strains of B.1.351 VOC, with two being unique (total 42)), as mentioned above, and compared them to the groups from Hawai'i.

Results
As of April 02, 2021, 43 unique lineages were identified with Pangolin Lineage Assigner from the 1,431 total sequences deposited in the GenBank and GISAID from Hawai'i (

Variants of concern consensus
Fig 2 shows the phylogenetic analysis of all VOC prevalent in Hawai'i reported worldwide rooted with the Lineage A reference sequence (EPI_ISL_406801) [4]. These trees were generated using FastTree in MANA HPC and Geneious Prime [15]. Based on this analysis, 228 of the 260 (87.69%) VOC found in Hawai'i have identifiable origins.  (Table 2).

B.1.429 California VOC
In the GenBank, as of April 02, 2021, 21 of 97 Hawai'i sequences were of lineage B.1.429 as determined with the Pangolin Lineage Assigner. One-hundred eighty-six sequences in GISAID from Hawai'i were identified as the B.1.429 VOC. Total B.1.429 VOC reported worldwide in GISAID was 15,393 as determined by applying the GISAID lineage filter. Thus, the starting sequence count was 15,416 (21 GenBank + 15,393 GISAID + 2 lineage origin = 15,416). Of the 15,416 sequences, 11,648 sequences were removed for being uninformative and containing incomplete sequences. Further, the sRNAtoolbox server removed 944 sequences containing duplicate sequences and 15 sequences of duplicate ID. The final alignment of 2,809 (15,416-11,648 ambiguous-959 duplicate sequence and ID = 2,809) strains and subsequent phylogenetic analysis defined the origin of the B.1.429 variant introduced into Hawai'i ( Fig  2A). Using this method, we were able to identify the origin of 183 of 207 B.1.429 sequences introduced into Hawai'i (Fig 3).

B.1.427 California VOC
In the GenBank, as of April 02, 2021, 8 of 97 Hawai'i sequences were B.1.427 VOC as determined with the Pangolin Lineage Assigner. Twenty-six sequences in GISAID from Hawai'i were identified as the B.1.427 VOC. Total B.1.427 lineages reported worldwide in GISAID were 6,562 as determined by applying the GISAID lineage filter. Thus, the starting sequence    2C). Using this method, we were able to identify the origin of 10 of 14 B.1.1.7 sequences introduced into Hawai'i (Fig 3).

Discussion
Precision public health genomics has been a tool in past outbreaks that has yet to be applied for the COVID-19 pandemic. These data and the method serve as a foundation for policymakers to apply precision public health genomics tools by discerning trends related to the source of SARS-CoV-2 introductions. By identifying the origin of SARS-CoV-2, policies can be reasonably constructed with evidence-based decisions. The origins of variants of concern in Hawai'i

Defining VOC using genetic characterization
As an effect of performing the origin defining method, described in the methods section, the MSA of all unambiguous VOC genomes can characterize the genome of VOC. Additionally, with this method, we can robustly compare the genomic similarities and differences between VOC. Within the S gene of the B.  [30]. The envelope protein was recently shown to interact with TLR2 and initiate inflammatory response [31]. Prolines are known to be involved in beta-turns, and the P71L substitution could significantly change the secondary and tertiary protein structures. This proline loss is striking for vaccines, since some vaccines effective against wild-type SARS-CoV-2, have diminished efficacy against the B.1.351 variant [32,33].
Efforts to understand variants still focus primarily on identifying the effect of individual mutations and substitutions. Many of the other mutations have yet to be evaluated experimentally, either individually or in concert with the D614G substitution. The ubiquitous D614G substitution increases the fitness of SARS-CoV-2, even at the cost of increased susceptibility to neutralization [34][35][36]. As such, SARS-CoV-2 has been evolving, and substitutions in the B.1.351 VOC and P.2 variant of interest (VOI), such as the E484K, are shown to confer resistance to neutralization and exhibit 50% increased transmission [37,38]. Recent publications focused on L452R mutation demonstrate significant decrease in the effectiveness of mAb treatments and neutralization by convalescent and vaccine sera [39,40]. Other mutated genes within VOC are orf1ab, ORF3a, M, ORF8, and N, as well as two introns. Mutations and substitutions in these genes and proteins, annotated in Table 2, are enigmatic, and warrant further studies. Conclusively, what is certain is that tracking the spread of these VOC, and determining the effects of their substitutions, is paramount in the effort to control the pandemic.

Ambiguous sequences
This method demonstrates the need for high-quality sequencing and the need for enrichment and deep coverage. For example, the first B.1.429 sequence deposited from Hawai'i was from a sample collected on December 31, 2020 (EPI_ISL_967766) is presently unusable. Without resequencing the whole genome or filling in with Sanger sequencing, this sequence is currently not of use in phylogenetics and origin determination due to ambiguous nucleotides in the S gene. While uninformative sequences may be useful for tracking the emergence of individual mutations, they are not useful in tracking VOC.

Public policy recommendations and impacts
Precision public health genomics is a public health policy tool to track the spread of viruses. In the age of fast-evolving digital information, precision public health genomics became prominent during the West Africa Ebola outbreak from 2014-2016 [41]. This tool has not been efficiently and effectively used during the COVID-19 pandemic due to the overwhelming worldwide sequencing effort. A testament to this effort is the deposition of over 18 million SARS-CoV-2 WGS in the GenBank and GISAID since January 2020. For precision public health genomics to be effective during the COVID-19 pandemic, high-throughput sequencing and high-speed, low-cost sequence data analysis, and robust phylogenetics are necessary. Also, a fast, effective, consistent, and economical method is required to analyze the vast amount of SARS-CoV-2 sequences and determine the origin of SARS-CoV-2 VOC and lineages in populations worldwide.
As the scientific community continues to understand the vaccine and healthcare consequences of VOC, of crucial importance is to control and limit the spread of these VOC. Policymakers should first ascertain the source of the spread before they can control and limit the spread of future VOC. By understanding the source responsible for the highest number of cases, policy-makers can look at interactions between that area and the host area, identify the reasons for the spread, and address those with appropriate measures both in the present and future COVID-19 waves. After policy-makers contain the source, healthcare providers will treat patients with the most effective therapy, and vaccines will uphold their efficacy. However, without such precision public health genomics in practice, society risks losing progress in the fight against this pandemic. As only 5.4% of the global population is fully vaccinated against SARS-CoV-2 as of May 28, 2021 [42], the possibility of new infections, even in vaccinated populations, is ever present without appropriate measures to deter the spread and evolution of SARS-CoV-2. To exemplify this healthcare point, the B.1.427 and B.1.429 are resistant to specific monoclonal antibody (mAb) therapies [40,43]. Clinically, of treatment importance, the Food and Drug Administration and California Department of Health and Human Services have stopped using Bamlanivimab due to reduced clinical activity against the B.1.427 and B.1.429 variants [43]. Furthermore, the B.1.351 VOC demonstrates some resistance to mAbs, convalescent sera, and vaccine sera neutralizing antibodies, and the B.1.1.7 is resistant to some mAbs [44].
Much of Hawai'i was under a mandatory stay-at-home order (2020-03-22-2020-05-31) from the Governor of Hawai'i's Third [45] (2020-03-22) through Eighth [46] (2020-05-18) COVID-19 Proclamation's, with intermittent and inter-island quarantines continuing through the Thirteenth [47] (2020-09-23) Proclamation. Trans-Pacific travelers were no longer required to quarantine as of October 15, 2020 [48]. Of interest is that even though VOC were circulating worldwide, none entered Hawai'i during the stay-at-home order or intermittent quarantines; all entered after the conclusion of the Thirteenth Proclamation and the reinstatement of Trans-Pacific travel. Furthermore, the average time it took for the VOC to arrive into Hawai'i decreased from an average of 320 days to 132 days when compared between the quarantine and post-quarantine groups, respectively. The data here argues in favor of the success of the quarantine here in Hawai'i in preventing the influx of the SARS-CoV-2 VOC. Further, from 2020-03-11 to 2020-06-16, 43 states participated in a collective nationwide quarantine averaging 46 days per state, although there was no point in time where all 43 states were under simultaneous quarantine. We observed that even the seven states not participating in quarantine were able to significantly delay arrival of VOC, analogous to the "concept of herd immunity." Furthermore, in the post-collective-quarantine period (2020/06/16-2021/05/20), wherein no states were practicing quarantine, there was no significant difference in days to VOC arrival. This finding supports the data that quarantine (2020/03/11-2020/06/16) indeed delayed the arrival of VOC. While individual states benefit from separate quarantine, the data here indicate the success and impact of collective quarantine of the 43-states within the United States.
As the purpose of this article is to facilitate evidence based public-health policy, the vast number of VOC (194/228) entering Hawai'i from the West Coast of the United States is alarming and warrants policies directed at controlling the spread of the VOC in the Hawaiian Islands.
From the analysis of the SARS-CoV-2 sequence data, a policy-maker could reasonably consider focusing on additional screening, contact tracing, and quarantine efforts among visitors and residents arriving from and traveling to the West Coast of the continental United States. In terms of public policies and precision public health genomics, this information can direct funding into scientific studies evaluating the effects of mutations prevalent in specific populations, particularly mutations within genes that affect SARS-CoV-2 immunogenic epitopes. Policies should encourage research focusing on developing pseudoviruses [49]. and infectious clones [50] to evaluate kinetics, virulence, anatomical localization, transmission, and neutralization by mAbs, convalescent sera, and vaccine sera. Funding to conduct the aforementioned research should be directed at local and national levels.

Limitations
As SARS-CoV-2 sequences continue to be submitted retrospectively, these data will evolve. As a tool for precision public health genomics, the highest value is the trends that this method elucidates. The quarantine data will also change, this analysis is a snapshot in time.

Conclusions
These methods demonstrate the ability of precision public health genomics to identify the origin of SARS-CoV-2, the success of quarantine in Hawai'i, and the concern of emerging VOC. The conclusion from defining the origin of VOC in Hawai'i is that California is the primary source of VOC circulating in Hawai'i. Additional screening and quarantining of the travelers from California while vacationing in Hawai'i will protect the local population from evasive SARS-CoV-2 VOC. A tool was needed to evaluate and make use of the vast worldwide sequencing effort and the tool herein fills that need. Moreover, our methodology demonstrates the ability of sequencing and phylogenetic analysis to provide precision public health genomics in policy-making decisions. As SARS-CoV-2 VOC spreads asymptomatically across the United States and worldwide, it is essential to use fast and accurate SARS-CoV-2 VOC, lineage, and origin assignment for making evidence-based public-policy decisions. github.com/dpmaison/Genomic-Analysis-of-SARS-CoV-2-Variants-of-Concern-Circulatingin-Hawai-i-to-Facilitate-Public-Healt.